Skip to content

fix(query-language): preserve grouped filters in regex mode#1138

Merged
brendan-kellam merged 2 commits intosourcebot-dev:mainfrom
brianphillips:paren-grouping-parsing
Apr 20, 2026
Merged

fix(query-language): preserve grouped filters in regex mode#1138
brendan-kellam merged 2 commits intosourcebot-dev:mainfrom
brianphillips:paren-grouping-parsing

Conversation

@brianphillips
Copy link
Copy Markdown
Contributor

@brianphillips brianphillips commented Apr 20, 2026

Summary

  • preserve query grouping in regex mode when parentheses wrap query syntax instead of raw regex atoms
  • keep bare regex groups like (foo|bar) parsed as single regex terms
  • add regression coverage for grouped prefix filters and filtered parenthesized regex or groups

Testing

  • ../../node_modules/.bin/vitest --run (from packages/queryLanguage)

🤖 Agent Plan

Expand full agent plan
  1. Trace how isRegexEnabled changes query parsing and identify where field-specific filters are converted into regex queries.
  2. Reproduce the failing parse shape and confirm regex mode is treating grouped filter parentheses as part of a regex term instead of query grouping.
  3. Update the query-language tokenizer so regex mode distinguishes raw regex parentheses from query-like parenthesized groups, including negated groups.
  4. Add regression tests for grouped prefix filters, grouped regex OR expressions behind filters, and the nested filter plus (regex or regex) grouping discussed in this thread.
  5. Run the focused packages/queryLanguage Vitest suite and verify the failing queries parse correctly.
  6. Exclude unrelated repository state changes, including the vendor/zoekt submodule rewind, from the PR.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Corrected regex mode search parsing to properly handle parenthesized filter groups. Query-style grouping syntax now works correctly when regex mode is enabled, allowing users to combine multiple filtered conditions with regex patterns for more precise and complex searches.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Walkthrough

The PR fixes regex mode query parsing by introducing offset-aware helpers and isRegexQueryGroupingAt() to distinguish query-grouping parentheses from regex atom parentheses. The tokenizer now selectively emits parenthesis tokens and refines negation logic based on whether parenthesized content represents query grouping syntax.

Changes

Cohort / File(s) Summary
Documentation & Comments
CHANGELOG.md, packages/web/src/features/search/parser.ts
Added entry documenting the regex search parsing fix and updated comments to clarify distinction between bare regex parentheses and query-like parenthesized expressions.
Tokenizer Logic
packages/queryLanguage/src/tokens.ts
Added offset-aware helpers (matchesStringAt, startsWithPrefixAt, skipWhitespace, isEscapedAt), introduced findMatchingCloseParenOffset(), findMatchingOpenParenOffset(), and isRegexQueryGroupingAt() for context-aware paren matching. Updated parenToken, closeParenToken, and wordToken behavior in Dialect_regex to emit parens only when they represent query grouping; simplified negateToken condition.
Test Coverage
packages/queryLanguage/test/regex.txt
Added four new test cases covering parenthesized prefix/filter groups with or keyword, verifying correct AST generation for combinations like ParenExpr(OrExpr(...)) and NegateExpr(ParenExpr(OrExpr(...))).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main fix: preserving grouped filters in regex mode, which is the core purpose of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
packages/queryLanguage/src/tokens.ts (2)

198-298: Consider early-exiting once we know the inside isn't a single token.

isRegexQueryGroupingAt enumerates every top-level token into topLevelTokens, but the only facts that matter afterwards are (a) whether there is exactly one top-level token, and (b) the starting offset of that token (for firstCh / prefix / -/"/( decisions). For a long paren body that already contains two or more top-level tokens, you can return true as soon as the second one is seen instead of finishing the scan.

♻️ Suggested simplification
-    const topLevelTokens: Array<{ start: number; end: number }> = [];
-
-    while (offset < closeOffset) {
-        const tokenStart = offset;
-        ...
-        topLevelTokens.push({ start: tokenStart, end: offset });
-        offset = skipWhitespace(input, offset);
-    }
-
-    if (topLevelTokens.length !== 1) {
-        return true;
-    }
-
-    const [{ start, end }] = topLevelTokens;
+    // Scan the first top-level token and bail out to "grouping" as soon as
+    // we discover a second one — we don't need the token list.
+    let firstTokenStart = -1;
+    let firstTokenEnd = -1;
+    let seenTokens = 0;
+    while (offset < closeOffset) {
+        const tokenStart = offset;
+        // ...same inner scan as today...
+        if (seenTokens === 0) {
+            firstTokenStart = tokenStart;
+            firstTokenEnd = offset;
+        }
+        seenTokens++;
+        if (seenTokens > 1) {
+            return true;
+        }
+        offset = skipWhitespace(input, offset);
+    }
+    if (seenTokens === 0) {
+        return true; // whitespace-only body
+    }
+    const start = firstTokenStart;
+    const end = firstTokenEnd;

Not a correctness issue — purely a small readability / worst-case improvement for deeply parenthesized regex bodies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 198 - 298, The function
isRegexQueryGroupingAt currently collects all top-level tokens into
topLevelTokens but only needs to know if there's exactly one and the start of
that one; change the loop so it tracks the first token's start (e.g.,
firstTokenStart) and a tokenCount instead of pushing into topLevelTokens, and as
soon as tokenCount becomes 2 return true immediately (early-exit) to avoid
scanning the rest of the paren body; keep the same logic afterward that uses the
single token's start (firstTokenStart) and end to check startsWithPrefixAt, '-'
handling, QUOTE, and nested OPEN_PAREN recursion.

300-335: Consider bounding the backward scan in findMatchingOpenParenOffset.

Unlike hasUnmatchedOpenParen (capped at 1000 chars backward), this helper walks the input backward without any upper bound. That wouldn't be a problem on its own, but in regex-mode wordToken it is invoked on every ) encountered while consuming a word (lines 536–541), and it is also the first thing closeParenToken does in regex mode. On pathological inputs (e.g., a long regex word containing many ) characters with no matching (), each ) triggers a full backward scan of everything consumed so far — an easy path to O(n²) tokenization.

Two options:

  1. Match the 1000-char safety limit used in hasUnmatchedOpenParen so the scan is bounded. This is the smallest change and keeps behavior consistent within this file.
  2. Better: memoize the result (or keep a running paren stack) in the regex-mode wordToken loop so the information is computed once up front instead of recomputed per ).
♻️ Minimal bounded-scan fix
 function findMatchingOpenParenOffset(input: InputStream): number | null {
     if (input.next !== CLOSE_PAREN) {
         return null;
     }

     let offset = -1;
     let depth = 1;

-    while (true) {
+    const MIN_OFFSET = -1000;
+    while (offset >= MIN_OFFSET) {
         const ch = input.peek(offset);
         ...
     }
+    return null;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/src/tokens.ts` around lines 300 - 335, The backward
scan in findMatchingOpenParenOffset can run unbounded and cause O(n²) behavior;
bound the scan like hasUnmatchedOpenParen by stopping after a fixed limit (e.g.,
MAX_UNMATCHED_SCAN = 1000) and returning null if exceeded. Edit
findMatchingOpenParenOffset to track how many positions it has moved (or check
offset <= -MAX_UNMATCHED_SCAN) and break/return null when the limit is reached;
keep the rest of the escape and depth logic unchanged so callers like wordToken
and closeParenToken get a safe, bounded result.
packages/queryLanguage/test/regex.txt (1)

56-74: Good regression coverage; minor inconsistency worth noting.

The new fixtures line up with the grammar (ParenExpr { openParen query? closeParen }, NegateExpr { negate (PrefixExpr | ParenExpr) }) and exercise the interesting regex-mode grouping paths (top-level OR group, AND with prefix + group, multi-filter AND + group with internal spaces, negated group).

Only nit: the repo: regex on line 62 encodes - as \x2d while line 67 uses a plain -. Both are equivalent regex atoms, and both should travel through the tokenizer as part of the same Term/prefix word, so the divergence isn't strictly needed — unless the intent was explicitly to cover the hex-escape path. Consider either aligning on plain - or adding a one-line comment above line 62 stating that \x2d is there to exercise hex escapes inside a regex prefix value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/queryLanguage/test/regex.txt` around lines 56 - 74, The fixture
using repo:^github\.com/sourcebot\x2ddev/sourcebot$ is inconsistent with the
later fixture that uses a plain hyphen; either change the hex-escaped \x2d to a
plain - so both regex-prefix fixtures use the same form, or add a one-line
comment above the repo:^github\.com/sourcebot\x2ddev/sourcebot$ fixture stating
that \x2d is intentionally used to exercise hex-escape handling in the tokenizer
(the surrounding grammar symbols to locate the case are the repo: prefix and the
ParenExpr/NegateExpr fixtures such as
Program(AndExpr(PrefixExpr(RepoExpr),ParenExpr(...)))).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/queryLanguage/src/tokens.ts`:
- Around line 198-298: The function isRegexQueryGroupingAt currently collects
all top-level tokens into topLevelTokens but only needs to know if there's
exactly one and the start of that one; change the loop so it tracks the first
token's start (e.g., firstTokenStart) and a tokenCount instead of pushing into
topLevelTokens, and as soon as tokenCount becomes 2 return true immediately
(early-exit) to avoid scanning the rest of the paren body; keep the same logic
afterward that uses the single token's start (firstTokenStart) and end to check
startsWithPrefixAt, '-' handling, QUOTE, and nested OPEN_PAREN recursion.
- Around line 300-335: The backward scan in findMatchingOpenParenOffset can run
unbounded and cause O(n²) behavior; bound the scan like hasUnmatchedOpenParen by
stopping after a fixed limit (e.g., MAX_UNMATCHED_SCAN = 1000) and returning
null if exceeded. Edit findMatchingOpenParenOffset to track how many positions
it has moved (or check offset <= -MAX_UNMATCHED_SCAN) and break/return null when
the limit is reached; keep the rest of the escape and depth logic unchanged so
callers like wordToken and closeParenToken get a safe, bounded result.

In `@packages/queryLanguage/test/regex.txt`:
- Around line 56-74: The fixture using
repo:^github\.com/sourcebot\x2ddev/sourcebot$ is inconsistent with the later
fixture that uses a plain hyphen; either change the hex-escaped \x2d to a plain
- so both regex-prefix fixtures use the same form, or add a one-line comment
above the repo:^github\.com/sourcebot\x2ddev/sourcebot$ fixture stating that
\x2d is intentionally used to exercise hex-escape handling in the tokenizer (the
surrounding grammar symbols to locate the case are the repo: prefix and the
ParenExpr/NegateExpr fixtures such as
Program(AndExpr(PrefixExpr(RepoExpr),ParenExpr(...)))).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 10c94d85-06a4-4244-80bb-961710e07771

📥 Commits

Reviewing files that changed from the base of the PR and between b2941c4 and 5dbf6b2.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • packages/queryLanguage/src/tokens.ts
  • packages/queryLanguage/test/regex.txt
  • packages/web/src/features/search/parser.ts

Copy link
Copy Markdown
Contributor

@brendan-kellam brendan-kellam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thx!

@brendan-kellam brendan-kellam merged commit 8b56ea8 into sourcebot-dev:main Apr 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants